fix minor issues found during KServe demo#127
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughPropagate KIND_CLUSTER and EXTRACT_GO_LOG through deploy tooling, manifests, operator/agent entrypoints, controllers, and job/PV/PVC helpers; make StorageClassName optional in API/CRDs; remove the Pod→PVC mutating webhook and its build/manifest wiring; update examples generation to Pod-focused overlays and change kernel cache mount path to /kernel-caches. ChangesDeploy & Config
Entrypoints & runtime wiring
Controllers & reconcile behavior
Job launch, PV/PVC helpers & predicates
API types & CRDs
Webhook removal & build cleanup
Examples, generation scripts & cleanup
Sequence DiagramsequenceDiagram
participant Makefile
participant Kustomize
participant Operator
participant Agent
participant Controller
participant K8s
Makefile->>Kustomize: prepare-deploy (KIND_CLUSTER=true) patch configMap literals
Operator->>K8s: start manager with KIND_CLUSTER & EXTRACT_GO_LOG
Agent->>K8s: start daemonset with KIND_CLUSTER and kernel-caches hostPath
Operator->>Controller: inject KindCluster + ExtractLogLevel into reconcilers
Agent->>Controller: inject KindCluster + ExtractLogLevel into agent reconcilers
Controller->>K8s: create/list PVs/PVCs/Jobs (LaunchJob includes kindCluster & GO_LOG)
K8s->>Controller: PV/PVC/Job events trigger reconcile flows
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
internal/controller/gkm-agent/common.go (1)
830-933:⚠️ Potential issue | 🟠 MajorThe KIND/manual-PV path does not implement the documented default StorageClass fallback.
The CRD for
storageClassNameis optional and documents that "If not provided, then default Storage Class will be used." However,GetStorageClassName()returns the raw field value without any defaulting logic, and bothCreatePv()andCreatePvc()assign this value directly—including empty strings—to the PV/PVC spec. In Kubernetes, explicitly settingstorageClassName: ""differs from omitting the field; the former disables default selection, while the latter enables it. The result is that omittedstorageClassNamein the CR breaks the documented contract by causing explicit empty-string assignments to the PV and PVC, preventing cluster default storage class usage.Implement either:
- Logic in
GetStorageClassName()or the KIND-path callers to skip settingStorageClassNameif empty, or- A default-resolution step before calling
CreatePv()andCreatePvc()that looks up and assigns the cluster default when the CR field is omitted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/gkm-agent/common.go` around lines 830 - 933, The KIND/manual-PV path is passing an explicit empty string for storageClassName (via (*gkmCache).GetStorageClassName()) into CreatePv and CreatePvc, which prevents Kubernetes from using the cluster default; change the call/site or helper so empty means "omit the field": update the callers here to capture sc := (*gkmCache).GetStorageClassName() and if sc == "" pass a nil/omitted value (or a pointer set to nil) into common.CreatePv and common.CreatePvc, and/or update common.CreatePv/common.CreatePvc to treat an empty storageClassName string by not setting Spec.StorageClassName (leave it nil) instead of writing "" to the PV/PVC; ensure GetStorageClassName need not change if you implement the omit-in-caller or omit-in-Create* behavior.cmd/main.go (1)
239-250:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWire
ExtractLogLevelinto the cluster-scoped reconciler too.
commonClnever receivesextractLogLevel, soClusterGKMCacheextraction jobs will still run with the zero-value log level even though the env/default was resolved above. That leaves half of the new log-level plumbing inactive.Suggested fix
commonCl := gkmOperator.ReconcilerCommonOperator[ gkmv1alpha1.ClusterGKMCache, gkmv1alpha1.ClusterGKMCacheList, gkmv1alpha1.ClusterGKMCacheNode, gkmv1alpha1.ClusterGKMCacheNodeList, ]{ Client: mgr.GetClient(), Scheme: mgr.GetScheme(), NoGpu: noGpu, KindCluster: kindCluster, + ExtractLogLevel: extractLogLevel, ExtractImage: extractImage, CrdCacheStr: "ClusterGKMCache", CrdCacheNodeStr: "ClusterGKMCacheNode", }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/main.go` around lines 239 - 250, The Cluster-scoped reconciler's commonCl initializer is missing the ExtractLogLevel field, so ClusterGKMCache extraction uses the zero-value; update the commonCl composite literal in cmd/main.go to include ExtractLogLevel: extractLogLevel (the same variable resolved earlier) so that the gkmOperator.ReconcilerCommonOperator for ClusterGKMCache receives the log level; ensure the field name matches the struct (ExtractLogLevel) and rebuild to propagate the log level into the ClusterGKMCache reconciler.internal/controller/gkm-operator/common.go (1)
767-784:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
KIND_CLUSTERis only wired into half of the operator path.This
LaunchJobcall now usesr.KindCluster, butmanagePVandPVC()still decides whether to manually create the hostPath PV fromr.NoGpuon Line 614. Non-KINDNO_GPU=truedeployments will still take the KIND-only PV path, so the new flag does not actually separate the two behaviors yet.Suggested fix
- if r.NoGpu { + if r.KindCluster {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/gkm-operator/common.go` around lines 767 - 784, managePVandPVC is still using r.NoGpu to decide creating the hostPath PV so NON-KIND deployments with NO_GPU=true incorrectly follow the KIND path; update managePVandPVC to check the cluster type (r.KindCluster or equivalent) when deciding to create hostPath PVs and/or when calling the KIND-specific helper so that only when r.KindCluster is true will the hostPath PV be created; ensure any logic branches or callers (e.g., where managePVandPVC is invoked and where LaunchJob is called) are consistent and reference r.KindCluster alongside/ instead of r.NoGpu to separate KIND behavior from the NO_GPU flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/shared_types.go`:
- Around line 80-84: The spec documents that omitting StorageClassName should
allow the cluster default, but CreatePvc currently always sets StorageClassName
by taking a pointer to the (possibly empty) string; change CreatePvc (the
function that builds the PVC spec) to only set the PVC's StorageClassName
pointer when the input StorageClassName is non-empty (e.g., check if
spec.StorageClassName != "" after trimming whitespace) and leave the field nil
otherwise so Kubernetes will select the default StorageClass; update any helper
that constructs &storageClass to follow the same conditional logic.
In `@config/configMap/configMap.yaml`:
- Line 15: The constant ConfigMapIndexKindCluster is defined with the wrong key
string ("gkm.kind.cluster") causing lookups to miss the YAML key; update the
value of ConfigMapIndexKindCluster to "gkm.kindcluster" so it matches the keys
used in the YAML files (e.g., configMap entries) and ensure any reference to
ConfigMapIndexKindCluster continues to compile and read the corrected key.
- Around line 8-10: The operator Deployment is not injecting EXTRACT_GO_LOG from
the ConfigMap so gkm.extract.log.level is ignored; update the operator manifest
(config/operator/operator.yaml) to add an environment variable named
EXTRACT_GO_LOG in the operator container that sources its valueFrom the
ConfigMap key gkm.extract.log.level (use configMapKeyRef, similar to
NO_GPU/KIND_CLUSTER entries), and remove the dead code in
internal/controller/gkm-operator/configmap.go that reads the extractLogLevel
variable only to log it (delete the unused extractLogLevel lookup and its log
statement) so the value is injected via the manifest instead of being a no-op in
code.
In `@config/configMap/kustomization.yaml`:
- Around line 11-12: Remove the hard-coded gkm.kindcluster=true from the shared
base kustomization (config/configMap/kustomization.yaml) and instead add that
patch/setting to the KIND overlay kustomization
(config/kind-gpu/kustomization.yaml); locate the base configMap entry that
currently sets gkm.kindcluster and delete it so the base uses the default in
configMap.yaml (false), then add a kustomizeConfigMapGenerator/patch or overlay
entry in config/kind-gpu/kustomization.yaml to set gkm.kindcluster=true so only
the KIND/demo overlay enables kind-specific behavior.
In `@Makefile`:
- Around line 299-314: The Makefile uses ifdef for KIND_CLUSTER and NO_GPU which
treats any non-empty value as true; replace those conditionals with explicit
value checks (ifeq ($(KIND_CLUSTER),true) and ifeq ($(NO_GPU),true)) and adjust
the sed invocations so you always generate kustomization.yaml but insert the
correct literal values based on the true/false check (set gkm.kindcluster=true
when KIND_CLUSTER == true else gkm.kindcluster=false; set gkm.nogpu=true when
NO_GPU == true else gkm.nogpu=false), keeping the existing AGENT_IMG and
EXTRACT_IMG substitutions (AGENT_IMG, EXTRACT_IMG, kustomization.yaml.env,
kustomization.yaml) intact.
---
Outside diff comments:
In `@cmd/main.go`:
- Around line 239-250: The Cluster-scoped reconciler's commonCl initializer is
missing the ExtractLogLevel field, so ClusterGKMCache extraction uses the
zero-value; update the commonCl composite literal in cmd/main.go to include
ExtractLogLevel: extractLogLevel (the same variable resolved earlier) so that
the gkmOperator.ReconcilerCommonOperator for ClusterGKMCache receives the log
level; ensure the field name matches the struct (ExtractLogLevel) and rebuild to
propagate the log level into the ClusterGKMCache reconciler.
In `@internal/controller/gkm-agent/common.go`:
- Around line 830-933: The KIND/manual-PV path is passing an explicit empty
string for storageClassName (via (*gkmCache).GetStorageClassName()) into
CreatePv and CreatePvc, which prevents Kubernetes from using the cluster
default; change the call/site or helper so empty means "omit the field": update
the callers here to capture sc := (*gkmCache).GetStorageClassName() and if sc ==
"" pass a nil/omitted value (or a pointer set to nil) into common.CreatePv and
common.CreatePvc, and/or update common.CreatePv/common.CreatePvc to treat an
empty storageClassName string by not setting Spec.StorageClassName (leave it
nil) instead of writing "" to the PV/PVC; ensure GetStorageClassName need not
change if you implement the omit-in-caller or omit-in-Create* behavior.
In `@internal/controller/gkm-operator/common.go`:
- Around line 767-784: managePVandPVC is still using r.NoGpu to decide creating
the hostPath PV so NON-KIND deployments with NO_GPU=true incorrectly follow the
KIND path; update managePVandPVC to check the cluster type (r.KindCluster or
equivalent) when deciding to create hostPath PVs and/or when calling the
KIND-specific helper so that only when r.KindCluster is true will the hostPath
PV be created; ensure any logic branches or callers (e.g., where managePVandPVC
is invoked and where LaunchJob is called) are consistent and reference
r.KindCluster alongside/ instead of r.NoGpu to separate KIND behavior from the
NO_GPU flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 7213e779-877f-4c9f-bdbd-ea53d8893f64
📒 Files selected for processing (14)
Makefileagent/main.goapi/v1alpha1/shared_types.gocmd/main.goconfig/agent/gkm-agent.yamlconfig/configMap/configMap.yamlconfig/configMap/kustomization.yamlconfig/crd/bases/gkm.io_clustergkmcaches.yamlconfig/crd/bases/gkm.io_gkmcaches.yamlconfig/operator/operator.yamlinternal/controller/gkm-agent/common.gointernal/controller/gkm-operator/common.gopkg/common/k8s.gopkg/utils/contants.go
Filippe found a few issues while trying to run GKM and KServe in the same cluster. - Requested that agents print the node they are running on in the logs. - The initContainer that gets added to the Job to extract the cache is only supposed to be there for KIND clusters but is triggering off the noGpu flag. Add a KIND Cluster flag that distinguishes between NoGPU and KIND Clusters. - The StorageClassName in the GKMCache and ClusterGKMCache is required when it should have been optional. Move it to optional. - Move the KServe model of Download PV/PVC and Serving PV/PVC. The Serving PVC creates an additional PVC that does not have hash which makes the volume mounting in the workload easier. Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
ef815fa to
6eea7b3
Compare
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/main.go (1)
227-240:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSet
ExtractLogLevelfor the cluster-scoped controller to match namespace-scoped configuration.The cluster-scoped
ClusterGKMCachecontroller (lines 227-240) is missing theExtractLogLevelfield, which is present in the namespace-scoped controller (line 215). Both controllers share the samereconcileCommonOperator()method that usesExtractLogLevelwhen launching extraction jobs. Additionally,agent/main.gosets this field for both namespace and cluster-scoped controllers (lines 194 and 218), confirming that both variants require consistent configuration. AddExtractLogLevel: extractLogLevel,to the cluster-scoped initialization.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/main.go` around lines 227 - 240, The cluster-scoped ReconcilerCommonOperator instance (commonCl for ClusterGKMCache / ClusterGKMCacheNode) is missing the ExtractLogLevel field so reconcileCommonOperator() won't receive the configured log level; add ExtractLogLevel: extractLogLevel, to the commonCl struct literal (the same field used in the namespace-scoped controller and set in agent/main.go) so ClusterGKMCache controllers pass the correct log level when launching extraction jobs.
🧹 Nitpick comments (4)
internal/controller/gkm-agent/namespace_gkmcache_controller.go (1)
171-172: ⚡ Quick winRemove commented code.
The cache-name label selector is commented out. Remove this commented line if the label is no longer used for node-scoped cache behavior.
♻️ Proposed fix
labelSelector := map[string]string{ - //utils.GKMCacheNodeLabelCache: cacheName, utils.GKMCacheLabelHostname: r.NodeName, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/gkm-agent/namespace_gkmcache_controller.go` around lines 171 - 172, Remove the dead/commented selector line so the selector map contains only active labels; specifically delete the commented utils.GKMCacheNodeLabelCache: cacheName line in the selector construction near where utils.GKMCacheLabelHostname is set (using r.NodeName) in namespace_gkmcache_controller.go so only the live label entries remain.internal/controller/gkm-agent/cluster_gkmcache_controller.go (1)
170-171: ⚡ Quick winRemove commented code.
The cache-name label selector is commented out. If this label is no longer needed for the node-scoped cache behavior, remove the commented line entirely.
♻️ Proposed fix
labelSelector := map[string]string{ - //utils.GKMClusterCacheNodeLabelCache: cacheName, utils.GKMCacheLabelHostname: r.NodeName, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/gkm-agent/cluster_gkmcache_controller.go` around lines 170 - 171, Remove the leftover commented label selector line in cluster_gkmcache_controller.go: delete the commented reference to utils.GKMClusterCacheNodeLabelCache so only the active selector entries (e.g., utils.GKMCacheLabelHostname: r.NodeName) remain; ensure no other logic or references rely on utils.GKMClusterCacheNodeLabelCache before removing it.gkm-extract/main.go (1)
63-67: ⚡ Quick winRemove commented code or document why it's disabled.
The
os.Chmodblock is commented out without explanation. If the permission change is no longer needed, remove the commented code entirely. If it's temporarily disabled, add a TODO comment explaining why and when it should be re-enabled.♻️ Proposed fix
- /* - if err := os.Chmod(cacheDir, 0755); err != nil { - log.Info("unable to chmod", "err", err) - } - */🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@gkm-extract/main.go` around lines 63 - 67, The commented os.Chmod(cacheDir, 0755) block should not remain unexplained; either delete the entire commented block or replace it with a short TODO explaining why permission changes are disabled and when to re-enable them (reference the cacheDir variable and the os.Chmod call), e.g., note if the chmod caused failures on certain platforms or is deferred pending a refactor of the cache handling; keep this as a single-line TODO comment directly where the commented block currently sits.pkg/common/k8s.go (1)
65-83: 💤 Low valueLog Delete event at
V(1)to match Update branch.
logger.Info("Delete:", ...)fires for every pod deletion seen by this watch (including ones the predicate ultimately rejects), which is much noisier than the Update branch'sV(1).Info. Consider downgrading to keep log volume bounded in busy clusters.♻️ Suggested change
- logger.Info("Delete:", + logger.V(1).Info("Delete:", "Pod Name", pod.Name, "Pod Namespace", pod.Namespace, "Pod Phase", pod.Status.Phase, "Pod PVC", hasPVC(pod), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/common/k8s.go` around lines 65 - 83, The DeleteFunc predicate is logging every pod deletion with logger.Info which is noisy; change the call in the DeleteFunc closure to use logger.V(1).Info(...) to match the Update branch and reduce log volume (keep the same fields: "Pod Name", pod.Name, "Pod Namespace", pod.Namespace, "Pod Phase", pod.Status.Phase, "Pod PVC", hasPVC(pod)). Ensure you update the logging invocation attached to the "pod-predicate" logger inside the DeleteFunc that references nodeName and hasPVC(pod).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/generate-files.sh`:
- Line 459: The script uses pushd ${VARIANTS_DIR_PODS} without checking for
failure; update the pushd invocation around the VARIANTS_DIR_PODS usage so it
verifies the directory change succeeded and aborts (or exits non‑zero) with a
clear error message if pushd fails (e.g., test directory exists or append a
conditional that logs and exits on failure). Apply the same pattern to the other
pushd calls that use the VARIANTS_DIR_* variables to ensure subsequent commands
never run in the wrong directory.
- Around line 505-507: Quote all shell variables in the kustomize commands to
prevent word-splitting and issues with spaces/special characters: update the
three invocations that use KUSTOMIZE, OVERLAY_DIR_SCOPE, OUTPUT_FILENAME and
OVERLAY_DIR_PODS so each variable is wrapped in double quotes (e.g.,
"$KUSTOMIZE" "$OVERLAY_DIR_SCOPE" "$OUTPUT_FILENAME" "$OVERLAY_DIR_PODS") and
ensure the redirections and append (>>) still operate on the quoted
OUTPUT_FILENAME.
In `@internal/controller/gkm-operator/common.go`:
- Around line 628-682: The PV-creation workaround is gated by the wrong flag:
the conditional currently checks r.NoGpu but should check r.KindCluster so
manual hostPath PVs are only created for KIND clusters; replace the if r.NoGpu {
... } block condition with if r.KindCluster { ... } (update any related
log/comment near the block if desired) and leave all calls to common.PvExists
and common.CreatePv, and assignments to
pvcStatus.PvName/pvCreated/updated/updateReason unchanged.
- Around line 998-1015: The switch handling latest PVC condition uses the wrong
counter for GkmCondOutdated: instead of incrementing the per-namespace counter
(cnts.PodOutdatedCnt) it increments gkmCacheStatus.Counts.PodOutdatedCnt
(causing double-counting and leaving namespaceCnts[ns].PodOutdatedCnt at zero).
Update the case for gkmv1alpha1.GkmCondOutdated in the block that calls
gkmv1alpha1.GetLatestConditionType(...) to increment cnts.PodOutdatedCnt (not
gkmCacheStatus.Counts.PodOutdatedCnt); ensure you do not add to the cache-level
PodOutdatedCnt here since that aggregate is already updated earlier, and keep
references to namespaceCnts and managePVandPVC behavior intact.
- Around line 228-240: The comment block describing PvcOwner behavior contains
two typos causing the codespell pipeline failure: change "downlaod" to
"download" and "refeneces" to "references" in the paragraph that mentions the
Serving PVC referencing the path used in the download PVC (the block discussing
AccessMode/ReadWriteOnce, Serving PV/PVC and download PVC); update those words
in internal/controller/gkm-operator/common.go so the wording is correct and the
pre-commit/codespell check passes.
- Around line 1264-1274: Rename the misspelled helper function deterineOwner to
determineOwner and update all call sites (including the caller that currently
invokes deterineOwner) to use the new name; inside the function (which currently
returns gkmv1alpha1.PvcOwnerAgent unless corev1.ReadOnlyMany is found) add a
check so that corev1.ReadWriteOncePod is treated the same as corev1.ReadOnlyMany
and sets pvcOwner to gkmv1alpha1.PvcOwnerOperator when present; ensure the
function signature and any imports/usage remain consistent with
gkmv1alpha1.PvcOwner and corev1.PersistentVolumeAccessMode.
In `@pkg/common/k8s.go`:
- Around line 368-402: labels.Parse error is never checked and the later "if err
:= objClient.List(...)" shadows err; ensure you validate the parse error
immediately and avoid shadowing for the List call. Change "labelSelector, err :=
labels.Parse(...)" to keep err and add "if err != nil { return nil, err }" right
after; then call List using assignment (err = objClient.List(...); if err != nil
{ return nil, err }) instead of ":=" so you don't shadow; update the final
return to "return failed, nil" (or "return failed, err" only when err is
non-nil) and apply the same fixes to GetGkmPvcList.
- Around line 239-245: The HostPath type used when constructing the PV
(HostPathVolumeSource.Type set to corev1.HostPathDirectory) differs from the
agent manifest which expects DirectoryOrCreate, causing PV creation to fail if
/kernel-caches is missing; update the PV creation to use
corev1.HostPathDirectoryOrCreate for HostPathVolumeSource.Type (or alternatively
ensure the directory is pre-created during cluster setup), referencing the
MountPath constant in pkg/utils/constants.go and the HostPathVolumeSource/Type
assignment in the PV creation code to locate and change the value.
In `@pkg/utils/contants.go`:
- Line 63: The example pod YAMLs still hardcode the old mount path "/cache"
while the Go constant MountPath now uses "/kernel-caches"; update the three
example files (examples/base/pods/pod-1.yaml, pod-2.yaml, pod-3.yaml) to set
mountPath: /kernel-caches (or replace occurrences of "/cache" with
"/kernel-caches") so they match the MountPath constant used by the codebase.
---
Outside diff comments:
In `@cmd/main.go`:
- Around line 227-240: The cluster-scoped ReconcilerCommonOperator instance
(commonCl for ClusterGKMCache / ClusterGKMCacheNode) is missing the
ExtractLogLevel field so reconcileCommonOperator() won't receive the configured
log level; add ExtractLogLevel: extractLogLevel, to the commonCl struct literal
(the same field used in the namespace-scoped controller and set in
agent/main.go) so ClusterGKMCache controllers pass the correct log level when
launching extraction jobs.
---
Nitpick comments:
In `@gkm-extract/main.go`:
- Around line 63-67: The commented os.Chmod(cacheDir, 0755) block should not
remain unexplained; either delete the entire commented block or replace it with
a short TODO explaining why permission changes are disabled and when to
re-enable them (reference the cacheDir variable and the os.Chmod call), e.g.,
note if the chmod caused failures on certain platforms or is deferred pending a
refactor of the cache handling; keep this as a single-line TODO comment directly
where the commented block currently sits.
In `@internal/controller/gkm-agent/cluster_gkmcache_controller.go`:
- Around line 170-171: Remove the leftover commented label selector line in
cluster_gkmcache_controller.go: delete the commented reference to
utils.GKMClusterCacheNodeLabelCache so only the active selector entries (e.g.,
utils.GKMCacheLabelHostname: r.NodeName) remain; ensure no other logic or
references rely on utils.GKMClusterCacheNodeLabelCache before removing it.
In `@internal/controller/gkm-agent/namespace_gkmcache_controller.go`:
- Around line 171-172: Remove the dead/commented selector line so the selector
map contains only active labels; specifically delete the commented
utils.GKMCacheNodeLabelCache: cacheName line in the selector construction near
where utils.GKMCacheLabelHostname is set (using r.NodeName) in
namespace_gkmcache_controller.go so only the live label entries remain.
In `@pkg/common/k8s.go`:
- Around line 65-83: The DeleteFunc predicate is logging every pod deletion with
logger.Info which is noisy; change the call in the DeleteFunc closure to use
logger.V(1).Info(...) to match the Update branch and reduce log volume (keep the
same fields: "Pod Name", pod.Name, "Pod Namespace", pod.Namespace, "Pod Phase",
pod.Status.Phase, "Pod PVC", hasPVC(pod)). Ensure you update the logging
invocation attached to the "pod-predicate" logger inside the DeleteFunc that
references nodeName and hasPVC(pod).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 589f0daa-7a11-4079-8588-1ca11d267c0f
📒 Files selected for processing (34)
Containerfile.gkm-operatorMakefileagent/main.goapi/v1alpha1/clustergkmcachenode_types.goapi/v1alpha1/gkmcachenode_types.goapi/v1alpha1/shared_types.gocmd/main.goconfig/agent/gkm-agent.yamlconfig/configMap/configMap.yamlconfig/configMap/kustomization.yamlconfig/crd/bases/gkm.io_clustergkmcachenodes.yamlconfig/crd/bases/gkm.io_clustergkmcaches.yamlconfig/crd/bases/gkm.io_gkmcachenodes.yamlconfig/crd/bases/gkm.io_gkmcaches.yamlconfig/operator/operator.yamlconfig/webhook/manifests.yamlexamples/.gitignoreexamples/base/pods/kustomization.yamlexamples/base/pods/pod-1.yamlexamples/base/pods/pod-2.yamlexamples/base/pods/pod-3.yamlexamples/cleanup-files.shexamples/generate-files.shexamples/variants/pods/kustomization.envgkm-extract/main.gointernal/controller/gkm-agent/cluster_gkmcache_controller.gointernal/controller/gkm-agent/common.gointernal/controller/gkm-agent/namespace_gkmcache_controller.gointernal/controller/gkm-operator/cluster_gkmcache_controller.gointernal/controller/gkm-operator/common.gointernal/controller/gkm-operator/namespace_gkmcache_controller.gointernal/webhook/pod_pvc_webhook.gopkg/common/k8s.gopkg/utils/contants.go
💤 Files with no reviewable changes (7)
- api/v1alpha1/gkmcachenode_types.go
- config/crd/bases/gkm.io_gkmcachenodes.yaml
- Containerfile.gkm-operator
- internal/webhook/pod_pvc_webhook.go
- config/webhook/manifests.yaml
- config/crd/bases/gkm.io_clustergkmcachenodes.yaml
- api/v1alpha1/clustergkmcachenode_types.go
✅ Files skipped from review due to trivial changes (2)
- config/configMap/configMap.yaml
- config/operator/operator.yaml
🚧 Files skipped from review as they are similar to previous changes (7)
- config/configMap/kustomization.yaml
- config/crd/bases/gkm.io_clustergkmcaches.yaml
- agent/main.go
- api/v1alpha1/shared_types.go
- Makefile
- config/crd/bases/gkm.io_gkmcaches.yaml
- internal/controller/gkm-agent/common.go
There was a problem hiding this comment.
♻️ Duplicate comments (3)
pkg/common/k8s.go (3)
370-406:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win
labels.Parseerror is unchecked and shadowed by the innerif err := objClient.List(...).The outer
errfromlabels.Parseis never inspected; theif err :=on Line 384 declares a newerrscoped to the block. A parse failure silently proceeds with a possibly empty/invalid selector, which would list all PVs cluster-wide and pass them to upstream cleanup logic. Additionally the finalreturn failed, erron Line 405 returns the original (now-unused) parse error, masking what actually happened. The same bug exists inGetGkmPvcList(Lines 691–710).🔧 Suggested fix
labelSelector, err := labels.Parse( utils.PvLabelCache + "," + utils.PvLabelPvcNamespace, ) + if err != nil { + return nil, err + } pvList := &corev1.PersistentVolumeList{} - if err := - objClient.List( - ctx, - pvList, - client.MatchingLabelsSelector{Selector: labelSelector}, - ); err != nil { + if err := objClient.List( + ctx, + pvList, + client.MatchingLabelsSelector{Selector: labelSelector}, + ); err != nil { return nil, err } @@ - return failed, err + return failed, nilApply analogous changes to
GetGkmPvcList.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/common/k8s.go` around lines 370 - 406, GetGkmPvFailedList currently ignores the error from labels.Parse and then shadows err with the objClient.List call; fix by checking the result of labels.Parse immediately and returning the parse error if non-nil (do not reuse the same variable name in a nested short-declaration), then perform objClient.List and handle its error separately (return nil, listErr on failure) so the final return returns the computed failed slice and no stale err; apply the same pattern to GetGkmPvcList (handle labels.Parse error first, avoid shadowing with if err := objClient.List(...), and return the correct errors).
684-711:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSame unchecked/shadowed
labels.Parseerror pattern asGetGkmPvFailedList.See the consolidated review on Lines 370–406. The parse error here is also discarded (shadowed by the
if err := objClient.List(...)on Line 697), and Line 710 returnserrwhich is the parse error. Apply the same fix.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/common/k8s.go` around lines 684 - 711, In GetGkmPvcList the error returned from labels.Parse (err) is being shadowed by the subsequent if err := objClient.List(...) and the function ends up returning the wrong/ambiguous err; after calling labels.Parse for labelSelector check err immediately and return (e.g., if err != nil { return nil, err }), then call objClient.List using a new variable name (e.g., listErr or reuse err without :=) so you don't shadow the parse error; ensure the final return returns the correct error (or nil) and that variables referenced are labelSelector, err and objClient.List.
227-244:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHostPath
Type: HostPathDirectorywill fail when/kernel-cachesis absent on the node.The PV uses
corev1.HostPathDirectorywhile the agent manifest mounts/kernel-cacheswithDirectoryOrCreate. Pods scheduled on a node where the agent hasn't yet created/kernel-cacheswill fail to bind/mount. Either switch tocorev1.HostPathDirectoryOrCreateor ensure the directory is guaranteed to pre-exist as part of node/cluster setup.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/common/k8s.go` around lines 227 - 244, The PV currently sets HostPath.Type to corev1.HostPathDirectory for the HostPath in the PersistentVolumeSpec (HostPathVolumeSource with Path "/kernel-caches"), which will fail if the directory doesn't exist on the node; change the HostPath.Type to corev1.HostPathDirectoryOrCreate in the HostPathVolumeSource (or alternatively ensure the node setup always pre-creates "/kernel-caches") so mounts succeed when the agent uses DirectoryOrCreate in its Pod spec.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/common/k8s.go`:
- Around line 370-406: GetGkmPvFailedList currently ignores the error from
labels.Parse and then shadows err with the objClient.List call; fix by checking
the result of labels.Parse immediately and returning the parse error if non-nil
(do not reuse the same variable name in a nested short-declaration), then
perform objClient.List and handle its error separately (return nil, listErr on
failure) so the final return returns the computed failed slice and no stale err;
apply the same pattern to GetGkmPvcList (handle labels.Parse error first, avoid
shadowing with if err := objClient.List(...), and return the correct errors).
- Around line 684-711: In GetGkmPvcList the error returned from labels.Parse
(err) is being shadowed by the subsequent if err := objClient.List(...) and the
function ends up returning the wrong/ambiguous err; after calling labels.Parse
for labelSelector check err immediately and return (e.g., if err != nil { return
nil, err }), then call objClient.List using a new variable name (e.g., listErr
or reuse err without :=) so you don't shadow the parse error; ensure the final
return returns the correct error (or nil) and that variables referenced are
labelSelector, err and objClient.List.
- Around line 227-244: The PV currently sets HostPath.Type to
corev1.HostPathDirectory for the HostPath in the PersistentVolumeSpec
(HostPathVolumeSource with Path "/kernel-caches"), which will fail if the
directory doesn't exist on the node; change the HostPath.Type to
corev1.HostPathDirectoryOrCreate in the HostPathVolumeSource (or alternatively
ensure the node setup always pre-creates "/kernel-caches") so mounts succeed
when the agent uses DirectoryOrCreate in its Pod spec.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: ba4ba995-cdf0-4577-a3f9-f5f522bb845c
📒 Files selected for processing (3)
Makefileinternal/controller/gkm-operator/common.gopkg/common/k8s.go
🚧 Files skipped from review as they are similar to previous changes (2)
- Makefile
- internal/controller/gkm-operator/common.go
1e58bc1 to
7fa3a12
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/common/k8s.go (1)
930-947: 💤 Low value
LaunchJobparameter order placesjobLogLevelafterlog logr.Logger.Convention in this file (and Go generally) is to place the logger last. Moving
jobLogLevelbeforelogkeeps the signature idiomatic and consistent with surrounding functions; it's a pure ordering change at all call sites.♻️ Suggested reordering
noGpu bool, kindCluster bool, extractImage string, pvcStatus *gkmv1alpha1.PvcStatus, podTemplate *gkmv1alpha1.PodTemplate, - log logr.Logger, jobLogLevel string, + log logr.Logger, ) error {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/common/k8s.go` around lines 930 - 947, The LaunchJob function signature currently places jobLogLevel after log (log logr.Logger); reorder the parameters so jobLogLevel appears before log to match the file's/logger-last convention (i.e., change LaunchJob(..., podTemplate *gkmv1alpha1.PodTemplate, jobLogLevel string, log logr.Logger) error). Update the LaunchJob declaration and all call sites to pass jobLogLevel in the new position, and run a build to ensure no remaining mismatched argument orders.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/common/k8s.go`:
- Around line 226-232: The PV labels map is using the PVC constant
utils.PvcLabelCacheNamespace which is inconsistent for PV objects; change that
key to utils.PvLabelCacheNamespace in the PV labels map (the map that currently
contains utils.PvLabelCache, utils.PvcLabelCacheNamespace,
utils.PvLabelPvcNamespace, utils.PvLabelNode, utils.PvLabelDigest) so PVs use
only PvLabel* constants and match the selector that expects
utils.PvLabelCacheNamespace.
---
Nitpick comments:
In `@pkg/common/k8s.go`:
- Around line 930-947: The LaunchJob function signature currently places
jobLogLevel after log (log logr.Logger); reorder the parameters so jobLogLevel
appears before log to match the file's/logger-last convention (i.e., change
LaunchJob(..., podTemplate *gkmv1alpha1.PodTemplate, jobLogLevel string, log
logr.Logger) error). Update the LaunchJob declaration and all call sites to pass
jobLogLevel in the new position, and run a build to ensure no remaining
mismatched argument orders.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 4a3c4040-79b7-4249-9a32-0b01e346d831
📒 Files selected for processing (22)
Containerfile.gkm-operatorMakefilecmd/main.goconfig/configMap/kustomization.yamlconfig/webhook/manifests.yamlexamples/.gitignoreexamples/base/access/rwo/ds-1.yamlexamples/base/access/rwo/ds-2.yamlexamples/base/access/rwo/ds-3.yamlexamples/base/access/rwo/kustomization.yamlexamples/base/pods/kustomization.yamlexamples/base/pods/pod-1.yamlexamples/base/pods/pod-2.yamlexamples/base/pods/pod-3.yamlexamples/cleanup-files.shexamples/generate-files.shexamples/variants/access/rwo/kustomization.envexamples/variants/pods/kustomization.envinternal/controller/gkm-operator/common.gointernal/webhook/pod_pvc_webhook.gopkg/common/k8s.gopkg/utils/contants.go
💤 Files with no reviewable changes (8)
- examples/base/access/rwo/ds-2.yaml
- examples/base/access/rwo/ds-3.yaml
- examples/variants/access/rwo/kustomization.env
- config/webhook/manifests.yaml
- examples/base/access/rwo/ds-1.yaml
- examples/base/access/rwo/kustomization.yaml
- Containerfile.gkm-operator
- internal/webhook/pod_pvc_webhook.go
✅ Files skipped from review due to trivial changes (4)
- examples/base/pods/pod-2.yaml
- examples/base/pods/pod-1.yaml
- examples/base/pods/pod-3.yaml
- examples/.gitignore
🚧 Files skipped from review as they are similar to previous changes (7)
- config/configMap/kustomization.yaml
- examples/cleanup-files.sh
- examples/variants/pods/kustomization.env
- Makefile
- pkg/utils/contants.go
- cmd/main.go
- internal/controller/gkm-operator/common.go
7fa3a12 to
3250f20
Compare
Move to the KServe pattern of Download PV/PVC and Serving PV/PVC. The Serving PVC creates an additional PVC that does not have hash which makes the volume mounting in the workload easier. With the DaemonSet requirement removed when Serving PVC was added, the pod mutating webhook is no longer needed. This webhook was used to find the Node specific PVC and update te Pod Spec of the pod launched from a DaemonSet with the value in the volumenMount. Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Filippe found a few issues while trying to run GKM and KServe in the same cluster.
Summary by CodeRabbit
New Features
Bug Fixes / Behavior Changes
Other